Whether geometric verification of point matches for absolute pose, homography estimation or normal estimation in point clouds, outlier rejection is a key ingredient in numerous applications in computer vision. RANSAC [FB81] is a common choice for the task as it has good quality/runtime tradeoffs for many practical problems. RANSAC however doesn’t provide an optimality guarantee, requires a lower bound on the inlier ratio and its runtime increases exponentially with the outlier ratio making it unusable for problems with few inliers. In contrast Branch&Bound methods [B03, CPKH20] are guaranteed to find the optimal solution, yet their worst-case runtime equals exhaustive search in parameter space. Their practical use depends on the quality of the bounds which are problem specific and notoriously hard to find.
We propose an alternative approach with linear runtime independent of the inlier ratio and that doesn’t require a lower bound on the fraction of inliers. We base our work on the efficient general algorithm for solving geometric incidence problems proposed in [AKKSZ19]. In a nutshell the authors show that voting [ZSP15] is equivalent to finding a point of maximum depth, i.e. to find a point in which is close to as many surfaces (and thus inliers) as possible among a given set of -surfaces in . We demonstrate an effective solution to outlier removal problems by transforming them to the surface incidence framework of [AKKSZ19]. Our contributions:
We introduce the concept of general voting and its relation to the approximate incidences of [AKKSZ19] for outlier removal to a wider computer vision audience and demonstrate its use for camera posing, ray intersection and geometric model fitting.
All approaches have linear complexity and are therefore applicable to very large problems, which are infeasible to solve with RANSAC [FB81]. The worst case complexity (and performance in practice) is always better than other methods (voting [HOU62] and B&B [HD60]) for sufficiently large inputs.
We demonstrate the scalability and versatility using a generalization of [ZSP15, SEKO16] where we remove the requirement on a calibrated camera and known gravity direction, but instead solve for these unknowns.
We compare to the state-of-the-art [CPKH20]
for 6 degree-of-freedom (6DoF) camera pose estimation and show that our approach is optimal and faster in practice. In comparison to[CPKH20] our surface intersection algorithm provides a tight upper bound on the score without problem specific knowledge.
We provide example-derivations and open-source code which serve as a tutorial for applying [AKKSZ19] to a range of outlier removal problems in computer vision.
2 The Family of General Alignment Problems
A large number of geometric vision problems can be viewed as general alignment problems where we aim to bring items in set a ”close” to items in other set by applying a transformation from a group of allowed transformations. This closeness can be defined by some given parameter, and it is common to use the Hausdorff [HK90] metric, the distance between items in and w.r.t. .
This alignment problem is at the core of many outlier rejection problems (e.g. the ones discussed initially). There we aim to bring a maximum number of items in to be at Hausdorff distance at most from some item in : For example in structure-based localization [LSHF12, SMTTHSSOPS18] is a set of rays in -space in the camera frame that we want to align with a set of d world points . In relative camera posing [FLOEK16, FL16] we aim to align two sets of -rays such that the maximum number of pairs approximately intersect -close. In all of these problems, one defines the objects and the allowed transformation group and then solves an approximate geometric incidence problem.
Any hypothetical match — a correspondence between an object in and an object in — defines a general surface of some dimension which is embedded in the ambient -space of transformations. Therefore, we can solve the inlier set maximization in linear time in the number of hypothetical matches by some kind of voting scheme: surfaces are intersected in the -space to locate the point closest to as many surfaces as possible. This point of maximum incidences represents the solution of the alignment problem.
3 Proximity and Incidences using Surfaces
Our work is based on the findings of [AKKSZ19] and we introduce their approach and notation using 2D line fitting as a toy example. The authors formulate the maximum incidence problem as one of reporting points in that are close to the majority of -dimensional surfaces. Each surface is given in parametric form where the first coordinates are the surface parameters. Specifically, each is defined in terms of essential parameters , and additional free additive parameters , one free parameter for each dependent coordinate. The surface is parameterized by and (we then denote as ) and defined by
W.l.o.g. the voting space is scaled to be the unit cube to simplify runtime complexity expressions.
For our toy example, we are given a set of points in 2D to which we want to fit a 2D line. Each point is transformed, by a standard point-line duality [EA57] to a line, lines to points and for each point-line pair, the vertical distance between them is preserved. The dimensional surface embedded in the line parameter space, describing all lines that intersect the point. The (single) surface equation in this case is
where the slope of a line is the essential parameter and the offset is the additive free parameter . Consequently, the point in with the largest number of close surfaces corresponds to the line parameters that fit the input points best (i.e. with maximal number of incidences or inliers).
3.1 A Naive Voting Solution
Once we formulate a problem as a general surface consensus problem, we obtain a naive algorithm: Iterate the first dimensions of the voting space on an -grid and compute the dependent variables for each surface for an neighborhood box around each grid vertex. This makes sure that for each vertex we collect all nearby surfaces at most apart from each other. Then it uses the dependent variable range to cast votes in the voting space (Algorithm 1).
For higher dimensional spaces this enumeration becomes costly, since the algorithm is independent of the actual structure of the surfaces and always takes time for -dimensional surfaces.
3.2 Efficient Canonized Generalized Voting
Algorithm 2 is our generalized voting procedure based on the work in [AKKSZ19]. We propose a coarse-to-fine scheme that decomposes the search space to significantly improve the runtime complexity. Our approach rounds surfaces so that we can group similar surfaces for joint processing, without affecting the outcome of the computation. This canonization means that in every recursive step of the algorithm (similar to levels in an octree decomposition) there are approximately the same number of surfaces to process. The consequence is a worst case runtime complexity of (see Theorem 4.4 in [AKKSZ19]) or ignoring log factors. For sufficiently large generalized voting is thus always asymptotically faster because of the multiplicative influence of on the approximation cost in naive voting.
For simplicity, we refer to a general parameter in the text. For guaranteed approximation as in [AKKSZ19] one needs to update slightly in order to compensate for accumulated error during the canonization and ensure that no close surface is missed. In practice we use different for each coordinate and tune the required values experimentally.
3.2.1 Algorithm and Implementation
Given a general outlier removal problem, the first step is to formulate the problem as a general surface with a parametric representation as in section 3.
From constraints to general voting. We start with a set of constraints, each of which implicitly define a surface embedded in (by the points that satisfy the constraint). The surface may have only dimensions (where ), meaning that w.l.o.g. given the other values, we can compute the point as a function of variables. For 2D line fitting (Eq. (2)) every 2D point defines a line in voting space (with , we have a dimensional surface embedded in the 2D ambient space). To use our framework, we have to provide two functions: (1) A predicate that returns whether a given surface intersects a given box in ; (2) A function that computes the dependent variables, given the other variables. In our example of 2D line fitting (1) computes whether intersects the given box and (2) is .
Canonization. Most Branch&Bound algorithms subdivide the parameter search, discarding branches early based on computed scores. A common such subdivision is the octree. A key ingredient that makes our generalized voting algorithm efficient is Canonization; a surface rounding process we apply before recursing to the next level of the octree. Surfaces which are close to each other in the current box are rounded and grouped into the same surface, thus bounding the overall number of surfaces.
For a surface and its rounded version we have, for each ,
where is some parameter depends on the approximation. The canonization approximates the input set, but significantly reduces the number of surfaces to process and thus runtime. During canonization we keep track of merged surfaces to recover the original surfaces (the set of inliers) after finding the maximum.
For each surface we first translate the surface such that the minimum box vertex coincides with the parameter space origin. Recalling the surface definition from Section 3, we round each free parameter to the closest integral multiple of and each essential parameter to the nearest integer multiple of where is the diameter of the current box and is a constant depending on the given approximation parameter and the surface.
Figure 1 shows the rounding process for the case of a 2D surface given by and a box . Note that our goal is to find a surface which is close to the input surface within the box . We first round the essential parameter to which moves the line away from the box, so we translate it by changing and then rounding it to . The key is that the number of canonical surfaces in is upper bounded independent of the number of input surfaces.
Canonization accumulates error with each rounding (bounded by octree levels). To guarantee approximation errors we adjust the original to ( is a some constant depending on the surface, see [AKKSZ19]). is a constant depending on the bounded and Lipschitz gradients of the surfaces, which can be estimated analytically [WZ96].
Estimating in practice. determines the runtime upper bound and the best parameter can vary among coordinates such that in practice we tune constants per parameter (different for and in the line fitting example).
Surface-Box intersection. There exist a general algorithm for any surface and box: Suppose the surface is given as a polynomial equation and that the box is . For simplicity let us focus on , so the surface is . If we assume is connected, then intersects if either (1) is fully contained in , or (2) it intersects some face, or (3) it intersects some edge. To test for case (1), pick an arbitrary point on and test if it lies in . If is not connected, repeat this for one point in each connected component. Cases (2) and (3) are handled recursively: for (2), we handle each face so that intersect with a 2D plane, where the polynomial is . For (3) we have a univariate polynomial and we need to test if it has a root in .
The above general algorithm can be slow in practice for many dimensions but we found that using a subset of the conditions is a good approximation in practice.
3.3 Theoretical Analysis of Related Work
To the best of our knowledge three alternatives to our algorithm exist and for which we compare asymptotic runtime.
RANSAC [FB81] is the most common approach to solve outlier rejection problems. The RANSAC complexity is for a lower bound on the fraction of inliers, a minimal set size needed to define a possible solution and input constraints. This is linear in only when is a constant and even then it grows quickly when is decreasing, making RANSAC unattractive for large scale problems.
Hough based methods [HOU62, DH72] exist in many variants, but all share a term with polynomial runtime complexity in which all sets of points (each of minimal size ) are traversed. For each subset we vote for the parameters defined by the subset, taking time to find the parameters with the maximum number of votes. Alternatively for each single point one can enumerate all bins in the Hough space corresponding to parameters of models passing through (or close) to . This takes where is the dimension of the Hough surface. A randomized version of Hough voting is applicable, if a lower bound on the number of inliers exists, though, it introduces the same limitation as RANSAC.
Branch&Bound [HD60] methods have been considered and implemented for a large number of optimization problems, including the family we consider here [B03, LH07, OKO09, FLOEK16, FL16]. They are optimal up to an error bound defined by the smallest box that terminates the process (say of size assuming w.l.o.g. that the space is ). The practical runtime of Branch&Bound can be low but is highly dependent on the structure of the surfaces and the quality of the bounds which are notoriously hard to find. The worst case runtime is (ignoring the log factor), where depends on the dimension of the problem, and is thus similar to the naive enumeration of cells. Recently, the 6DOF posing problem was solved without correspondences using B&B [CPKH20], against which we evaluate in Section 5 and in the supplemental material. The worst case runtime presented in [CPKH20] is for 3d points, frame key points and are tolerance parameters similar to our . This aligns with the naive method, as is the number of matches in the problem (though it can be much faster in practice).
Our generalized voting approach is asymptotically and in practice faster and more general than these alternatives while being easy to implement. Existing B&B algorithms can be combined with the proposed canonization, to leverage high quality bounds in combination with a guaranteed worst case runtime.
In the following, we enumerate a list of typical outlier rejection problems in geometric computer vision and we show how each one of them can be reduced to the incidence problem and solved efficiently using our method. We deliberately do not compare to the numerous state of the art methods across all applications, but rather want to emphasize the generality of our approach.
4.1 Fitting Hyperplanes in -space
Model fitting is a well investigated problem in the literature [SS01, DH72, HOU62], and commonly solved using RANSAC or Hough voting. It’s also mentioned as an example in [AKKSZ19]
and solved using a primal-dual method. The goal is to report the hyperplane consistent with most points (distance).
In order to solve this using our general method, we first use the point-hyperplane duality [EA57]. Points are transformed to hyperplanes and hyperplanes to points with the vertical distance between these plane-point pair being preserved. If we keep hyperplanes in appropriate orientation this algebraic distance (the coordinate) is a good approximation to the Euclidean distance we want to minimize. We then search for the point that is consistent with the maximum number of hyperplanes (=surfaces) and transform it back via duality to obtain the best fitting hyperplane. For example the parameterization in 3D is the standard plane equation:
with in and the runtime is compared to in the naive method. The -dimensional algorithm is then better for any (for any fixed ).
We evaluate the runtime for the 2D case by comparing B&B and Ransac to our method where we implemented the problem specific surface definition and the intersection predicate. The test data consists of uniformly sampled outlier points in the unit cube and inlier points sampled along a fixed line with additive noise. We apply this for both, an increasing number of points and increasing fraction of outliers (which we provide to Ransac to estimate the iteration count). As can be seen in Fig. 2 for small inlier fractions Ransac becomes slow and our method outperforms due to the better asymptotic runtime. Results for B&B represent the worst case due to the uniformity of the input data. In all cases, accurate line parameter were found using .
4.2 Absolute Camera Posing Problems
The authors of [AKKSZ19] show how to define the surfaces for a calibrated camera with known gravity direction which is a setup with 4 DoF that is also used in [ZSP15, SEKO16]. In the following we derive formulations for camera setups with higher DoF that are difficult to handle by a naive voting approach due to the higher dimensional solution space. Our generalized voting provides a good approximation of the best pose which can then be taken by a more accurate method (e.g. a minimal solver within RANSAC) that then operates on the remaining point set largely free from outliers.
Let be a point in and let be a point in the (normalized) image plane. The triple is a correspondence and we show how the constraints from the correspondence are reduced to a general surface . We provide runtime complexities of the following formulations in Table 1.
|Posing problem||Known focal||Known gravity||Generalized voting||Naive voting||General faster naive for|
|5DoF (radial camera)||-||-|
4.2.1 Unknown focal length, known gravity direction (5DoF)
We use with coordinates as our search space, where is the camera position, with denoting the (remaining) camera orientation around gravity, and is the focal length. Each such coordinate models a possible pose of the camera plus focal length. A correspondence is parameterized by the triple , and defines a -dimensional algebraic surface . It is the locus of all camera poses and focal lengths at which it sees at image coordinates . We can rewrite these equations into the following parametric representation of , expressing and as functions of , and .
Our goal is to find a tuple such that as many correspondences as possible are (approximately) consistent with it. In other words the ray from the camera center to goes approximately (i.e. up to some reprojection error) through in the image plane.
In the case of our -surfaces in -space, the parameter is free, and we introduce a second artificial free parameter into equation 4 for .The number of essential parameters is (they are ,,, and ). With and , using the general technique from [AKKSZ19], we obtain an algorithm that takes (for correspondences) . The naive method would take time so that the general technique is asymptotically better (ignoring poly logarithmic factors) for any .
4.2.2 Unknown focal length (7DoF)
We use as the -tuple of unknowns that we aim to solve for, where
is a 3-vector describing the minimal rotation parameterization (e.g. angle-axis). Given the standard generalprojection matrix the world point projects to
where denote the row of the rotation matrix. This reveals constraints
which allows to parameterize and as a a function of for each correspondence according to
With the naive rendering takes while general voting takes which is better for any .
4.2.3 Calibrated camera (6DoF)
This is identical to the 7DoF case where we set . With , naive rendering takes while generalized voting takes as for 7DoF case which is better for any .
4.2.4 Unknown focal length using a radial camera model (5DoF)
The previous formulation for the 6DoF pose plus focal length case requires a -dimensional voting space. As an alternative we propose to leverage a radial camera model [TP12] which is known to work well for pose estimation with unknown focal length [BZT10, LSKP19]. The approach factors the pose estimation in two consecutive steps where the first relies on line-to-point correspondences and solves for all parameters except the focal length and the camera motion along the optical axis. A second upgrade step then solves for the remaining parameters using a least squares fit. For outlier removal we focus on the first step and therefore are looking for a -tuple . The radial camera projection matrix is
and we have , where are the rows of . We then obtain where we parameterize as a function of :
We have and the algorithm takes time, compared to with naive rendering. This makes the general algorithm better for any .
Localizing an uncalibrated camera with a known axis of rotation is a common problem in computer vision and both, the problem dimensionality and the number of unknowns are well suited to demonstrate the general use of our method. Therefore, we focus on the 5DOF problem described in Section 4.2.1 here and compare the runtimes of our generalized voting approach with naive voting, Branch&Bound, and RANSAC. Note, that Section 5 presents a more in depth evaluation, including localization performance, against the start-of-the-art (Branch&Bound) method for 6DoF camera posing on public datasets.
Our evaluation data exhibits real-world, large-scale scenes where a set of d points and potentially corresponding image points are given. We create them by matching a real query image against 3D points of a SfM model. Using 7, 14, and 56 candidates in the nearest neighbor search results in problem sizes of , , and matches.
For the solution computations we consider typical space limitations for all methods: Due to the nature of the parameterization we only consider a camera orientation with tangent in , rotating the scene accordingly if needed. We bound the spatial position to be at most m and the camera height m from the ground truth. The focal length is bounded within a fraction of the ground truth.
Figure 3 illustrates our results which demonstrate that our method is the most efficient. In order to eliminate implementation details which can change runtimes considerably, we do not measure time but instead count the number of dominant operations. That is the number of d grid cells that are touched in the naive implementation and the number of calls to the surface-box intersection predicate for our algorithm and for B&B. For RANSAC we simplified the problem to one with known focal length and were thus able to use a 3-point minimal solver with early rejection to account for the known gravity direction. We also tuned the number of iterations and inlier tolerance to find a good pose with minimal runtime. As expected, the gap between the methods and the improvement in the proposed algorithm is increasing with the size of the input as suggested by the asymmetrical complexity.
4.3 Intersections of Rays/Lines in -space
The problem setup consists of a set of input rays in -space and a grid of cell size (within some bounding box). The goal is to report all subsets of rays intersecting any (non-empty) cell and with cardinality larger than some threshold. We can formulate the general surfaces using:
With the naive algorithm takes while the general voting takes which is better (ignoring logarithmic factors) for any .
4.4 Pointset Alignment
In this problem and are sets of points in the plane. Any hypothetical correspondence between and defines a surface in the -space of similarity transformations (translation, rotation and scale). The linear parametrization is:
where is the scale, is the rotation and is the translation vector. The general surface is then:
Which is a -surface embedded in where are given as a functions of . In this case we have two essential parameters () and we introduce two artificial new free parameters. With and with as the number of canonical surfaces the naive algorithm would take and the general voting takes which is better for any (ignoring logarithmic factors).
5 Comparison with GOPAC [Cpkh20]
Solving the 6DOF camera posing problem without known correspondences is hard due to the large search space. Recently, the authors of [CPKH20] presented a solution using a globally optimal method based on Branch&Bound and conducted extensive evaluations on public datasets (Data61/2D3D [NNSP15] and Stanford 2D-3D-Semantics (2D-3D-S) [ASZS17]) against state-of-the-art alternatives.
We evaluate our general voting on this problem, viewing it as a Branch&Bound search. We compare our results with [CPKH20] on the same datasets, using the GOPAC code provided by the authors (re-run on our machine to ensure comparability to our approach), showing that our general algorithm is not only asymptotically better but also faster in practice. Both, our algorithm and GOPAC, are globally optimal up to a prescribed resolution and perform joint inlier set maximization and correspondence search. Our method achieves a significantly better worst case runtime of compared to in GOPAC.
In order to implement the 6DOF solver in our general framework we reformulated the problem as a surface consensus problem according to Sec. 4.2.3. Compared to GOPAC our formulation does not use an angular error metric, but is based on surface distances. However, it is possible to obtain the globally optimal solution by conservative expansion of the cubes (to avoid missing inliers) and verify the angular projection error on the final inlier set for the minimal cuboids.
Data61/2D3D [Nnsp15] outdoor dataset:
The dataset consists of a 88 3D points and 11 sets of 30 bearing vectors. Table 2 shows the localization performance and runtime for GOPAC, our generalized voting and RANSAC. Both Branch&Bound algorithms show comparable accuracy due to their global optimality and we only expect a difference in runtime. Here our algorithm is more than an order of magnitude faster. Equivalent to [CPKH20] we restrict the translation solution to a domain along the street (as it’s known that cameras are mounted on a vehicle), and a camera is considered successfully posed, if the rotation error is less than radians and the normalized translation error is less than .
|Translation Error (m)||2.76||2.89||28.5|
|Rotation Error ()||2.18||0.46||179|
|Success rate (inliers)||1||1||0|
|Success rate (pose)||0.82||0.82||0.09|
Stanford 2D-3D-Semantics [Aszs17] indoor dataset:
The dataset consists of 15 rooms of different types and 27 sets of 50 bearing vectors. Table 3 lists our performance and runtime comparison. As the experimental evaluation of [CPKH20] uses a GPU implementation we reran both algorithms on CPU using 8 threads to ensure comparability. In order to obtain reasonable runtimes we set to 0.5m and 0.1 radians, respectively for both methods. Again we expect a similar localization performance due to the global optimality of both methods. On average the runtime of our algorithm on this dataset is only slightly better. As noted in Section 3.2.1, the effect of the asymptotically better worst case runtime increases with the hardness of the problem (both, in size and data distribution). Therefore, Figure 4 depicts the (sorted) runtimes for all queries, which illustrates that our algorithm becomes significantly faster for the hard cases.
|Translation Error (m)||0.29||0.38|
|Rotation Error ()||3.46||2.81|
|Success rate (inliers)||1||1|
|Success rate (pose)||0.77||0.77|
|Success rate within 60s||0.19||0.42|
The supplementary material contains additional comparisons on large scale outdoor datasets.
In this paper we introduced the concept of general voting, a powerful method for outlier rejection applicable to a wide range of geometric computer vision problems. We adopt the previously proposed method of approximate incidences to solve for inlier maximization in multiple classical computer vision problems ranging from camera posing and ray intersection to geometric model fitting. We described the general recipe with a simple to understand 2d-line fitting example, but also demonstrated its applicability to real-world problems.
Through theoretical analysis and experiments we demonstrated that our algorithm scales better than its alternatives, like RANSAC or Branch&Bound, both in terms of complexity and real-world runtime. The experimental data validated that the proposed solution performs particularly well for large problems with low inlier ratios where alternative solutions require problem specific knowledge to remain applicable.
One of the key use-cases we investigated is camera posing with and without known gravity direction and focal length, problems that cannot be solved efficiently at large scale with previously published methods, yet that have wide applicability in industry. To solve these cases, we introduced two algorithms that are key to efficiency: canonization of the intersected surfaces and an efficient -box intersection algorithm which we combine in a spatial subdivision scheme. To demonstrate the impact of these contributions we provided an extensive evaluation against a recently published state-of-the art method on publicly available, large-scale indoor and outdoor datasets.
Beside solving concrete localization approaches this work introduced the concept of general voting to the wider computer-vision community. We aim to provide a recipe for applying this approach to a range of problems and publish an open-source implementation of our efficient general voting to unlock new applications and research directions.
The authors would like to thank Micha Sharir for helpful discussions concerning the general surface-box intersection.